Python Pandasã®å æ¬çãªã¬ã€ãã§ãããŒã¿ã»ããå ã®æ¬ æããŒã¿ã®è€éããä¹ãè¶ããŸããããã°ããŒãã«ãªèªè å±€ã«é©ãããè£å®ãšåé€ã®å¿ é ãã¯ããã¯ãåŠã³ãŸãã
Mastering Python Pandas Data Cleaning: A Global Guide to Missing Value Handling
ããŒã¿åæã𿩿¢°åŠç¿ã®åéã§ã¯ãããŒã¿ã®å質ãæãéèŠã§ããæãåºç¯ãªèª²é¡ã®1ã€ã¯ãæ¬ æå€ã®ååšã§ãããããã¯ãããŒã¿å ¥åãšã©ãŒãã»ã³ãµãŒã®èª€åäœãäžå®å šãªèª¿æ»ãªã©ãããŸããŸãªãœãŒã¹ããçºçããå¯èœæ§ããããŸããæ¬ æããŒã¿ã广çã«åŠçããããšã¯ãããŒã¿ã¯ãªãŒãã³ã°ããã»ã¹ã«ãããéèŠãªã¹ãããã§ãããåæãå ç¢ã§ã¢ãã«ãæ£ç¢ºã§ããããšãä¿èšŒããŸãããã®ã¬ã€ãã§ã¯ãã°ããŒãã«ãªèªè å±€åãã«èšèšãããã匷åãªPython Pandasã©ã€ãã©ãªã䜿çšããŠæ¬ æå€ã管çããããã®å¿ é ãã¯ããã¯ã«ã€ããŠèª¬æããŸãã
Why is Handling Missing Values So Crucial?
æ¬ æããŒã¿ã¯ãçµæãå€§å¹ ã«æªããå¯èœæ§ããããŸããå€ãã®åæã¢ã«ãŽãªãºã ãšçµ±èšã¢ãã«ã¯ãæ¬ æå€ãåŠçããããã«èšèšãããŠããªãããããšã©ãŒãåã£ãçµæã«ã€ãªãããŸããäŸãã°ïŒ
- Biased Averages: æ¬ æå€ãç¹å®ã®ã°ã«ãŒãã«éäžããŠããå Žåãå¹³åãèšç®ãããšãæ¯éå£ã®çã®ç¹æ§ã誀ã£ãŠè¡šçŸããå¯èœæ§ããããŸãã
- Reduced Sample Size: æ¬ æå€ã®ããè¡ãŸãã¯åãåçŽã«åé€ãããšãããŒã¿ã»ãããå€§å¹ ã«çž®å°ãã貎éãªæ å ±ãšçµ±èšçæ€åºåã倱ãããå¯èœæ§ããããŸãã
- Model Performance Degradation: äžå®å šãªããŒã¿ã§ãã¬ãŒãã³ã°ãããæ©æ¢°åŠç¿ã¢ãã«ã¯ãäºæž¬ããã©ãŒãã³ã¹ãšæ±åèœåãäœäžããå¯èœæ§ããããŸãã
- Misleading Visualizations: æ¬ æããŒã¿ãã€ã³ããèæ ®ãããŠããªãå Žåããã£ãŒããšã°ã©ãã¯äžæ£ç¢ºãªå³ã瀺ãå¯èœæ§ããããŸãã
æ¬ æå€ãçè§£ãã察åŠããããšã¯ãå°ççãªå Žæãæ¥çã«é¢ä¿ãªãããã¹ãŠã®ããŒã¿å°éå®¶ã«ãšã£ãŠåºæ¬çãªã¹ãã«ã§ãã
Identifying Missing Values in Pandas
Pandasã¯ãæ¬ æããŒã¿ãæ€åºããããã®çŽæçãªã¡ãœãããæäŸããŸããæ¬ æå€ã®äž»ãªè¡šçŸã¯ãæ°å€ããŒã¿ã®å Žåã¯NaNïŒNot a NumberïŒããªããžã§ã¯ãããŒã¿åã®å Žåã¯Noneã§ããPandasã¯äž¡æ¹ãæ¬ æãšããŠæ±ããŸãã
The isnull() and notnull() Methods
isnull()ã¡ãœããã¯ãåã圢ç¶ã®ããŒã«å€DataFrameãè¿ããå€ãæ¬ æããŠããå Žåã¯Trueãããã§ãªãå Žåã¯Falseã瀺ããŸããéã«ãnotnull()ã¯æ¬ æããŠããªãå€ã«å¯ŸããŠTrueãè¿ããŸãã
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {'col1': [1, 2, np.nan, 4, 5],
'col2': [np.nan, 'b', 'c', 'd', 'e'],
'col3': [6, 7, 8, np.nan, 10]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nChecking for null values:")
print(df.isnull())
print("\nChecking for non-null values:")
print(df.notnull())
Counting Missing Values
åããšã®æ¬ æå€ã®æŠèŠãååŸããã«ã¯ãisnull()ãsum()ã¡ãœãããšãã§ãŒã³ã§ããŸãã
print("\nNumber of missing values per column:")
print(df.isnull().sum())
ãã®åºåã¯ãååã«ååšããæ¬ æãšã³ããªã®æ°ãæ£ç¢ºã«ç€ºããåé¡ã®ç¯å²ã®æŠèŠããã°ããæäŸããŸãã
Visualizing Missing Data
å€§èŠæš¡ãªããŒã¿ã»ããã®å Žåãæ¬ æããŒã¿ãèŠèŠåãããšéåžžã«æŽå¯åãé«ãŸããŸããmissingnoã®ãããªã©ã€ãã©ãªã¯ãæ¬ æã®ãã¿ãŒã³ãç¹å®ããã®ã«åœ¹ç«ã¡ãŸãã
# You might need to install this library:
# pip install missingno
import missingno as msno
import matplotlib.pyplot as plt
print("\nVisualizing missing data:")
msno.matrix(df)
plt.title("Missing Data Matrix")
plt.show()
ãããªãã¯ã¹ããããã¯ãããŒã¿ãååšããååã®å¯ãªããŒãšãããŒã¿ãæ¬ èœããŠããçãªããŒã瀺ããŠããŸããããã«ãããæ¬ æãã©ã³ãã ã§ãããããã¿ãŒã³ã«åŸã£ãŠããããæããã«ã§ããŸãã
Strategies for Handling Missing Values
æ¬ æããŒã¿ãåŠçããããã®ããã€ãã®äžè¬çãªæŠç¥ããããŸããæŠç¥ã®éžæã¯ãå€ãã®å ŽåãããŒã¿ã®æ§è³ªãæ¬ æå€ã®å²åãããã³åæã®ç®æšã«ãã£ãŠç°ãªããŸãã
1. Deletion Strategies
åé€ã«ã¯ãæ¬ æå€ãæã€ããŒã¿ãã€ã³ãã®åé€ãå«ãŸããŸããäžèŠåçŽã«èŠããŸããããã®æå³ãçè§£ããããšãéèŠã§ãã
a. Row Deletion (Listwise Deletion)
ããã¯æãç°¡åãªã¢ãããŒãã§ããå°ãªããšã1ã€ã®æ¬ æå€ãå«ãè¡å šäœãåé€ããŸãã
print("\nDataFrame after dropping rows with any missing values:")
df_dropped_rows = df.dropna()
print(df_dropped_rows)
Pros: å®è£ ãç°¡åã§ãæ¬ æå€ãåŠçã§ããªãã¢ã«ãŽãªãºã ã«å¯ŸããŠã¯ãªãŒã³ãªããŒã¿ã»ãããåŸãããŸãã
Cons: ããŒã¿ã»ãããµã€ãºã®èããçž®å°ã«ã€ãªããå¯èœæ§ããããæ¬ æãå®å šã«ã©ã³ãã ã§ãªãå ŽåïŒMCAR - Missing Completely At RandomïŒã貎éãªæ å ±ã倱ããããã€ã¢ã¹ãçºçããå¯èœæ§ããããŸãã
b. Column Deletion
ç¹å®ã®åã®æ¬ æå€ã®å²åãéåžžã«é«ããåæã«ãšã£ãŠéèŠã§ãªãå Žåã¯ãåå šäœãåé€ããããšãæ€èšãããããããŸããã
# Example: Drop 'col1' if it had too many missing values (hypothetically)
# For demonstration, let's create a scenario with more missing data in col1
data_high_missing = {'col1': [1, np.nan, np.nan, np.nan, 5],
'col2': [np.nan, 'b', 'c', 'd', 'e'],
'col3': [6, 7, 8, np.nan, 10]}
df_high_missing = pd.DataFrame(data_high_missing)
print("\nDataFrame with potentially high missingness in col1:")
print(df_high_missing)
print("\nMissing values per column:")
print(df_high_missing.isnull().sum())
# Let's say we decide to drop col1 due to high missingness
df_dropped_col = df_high_missing.drop('col1', axis=1) # axis=1 indicates dropping a column
print("\nDataFrame after dropping col1:")
print(df_dropped_col)
Pros: åãæ¬ æããŒã¿ã®ããã«ã»ãšãã©åœ¹ã«ç«ããªãå Žåã«å¹æçã§ãã
Cons: 貎éãªæ©èœã倱ãããå¯èœæ§ããããŸãããæ¬ æå€ãå€ãããããšããéŸå€ã¯äž»èгçã§ãã
2. Imputation Strategies
è£å®ã«ã¯ãæ¬ æå€ãæšå®å€ãŸãã¯èšç®å€ã§çœ®ãæããããšãå«ãŸããŸããããã¯ãããŒã¿ã»ãããµã€ãºãä¿æããããããåé€ãããåªå ãããããšããããããŸãã
a. Mean/Median/Mode Imputation
ããã¯ãäžè¬çã§åçŽãªè£å®ææ³ã§ããæ°å€åã®å Žåãæ¬ æå€ããã®åã®æ¬ æããŠããªãå€ã®å¹³åãŸãã¯äžå€®å€ã§çœ®ãæããããšãã§ããŸããã«ããŽãªåã®å Žåãæé »å€ïŒæãé »ç¹ãªå€ïŒã䜿çšãããŸãã
- Mean Imputation: éåžžååžããŒã¿ã«é©ããŠããŸããå€ãå€ã«ææã§ãã
- Median Imputation: å¹³åè£å®ãããå€ãå€ã«å¯ŸããŠããå ç¢ã§ãã
- Mode Imputation: ã«ããŽãªæ©èœã«äœ¿çšãããŸãã
# Using the original df with some NaN values
print("\nOriginal DataFrame for imputation:")
print(df)
# Impute missing values in 'col1' with the mean
mean_col1 = df['col1'].mean()
df['col1'].fillna(mean_col1, inplace=True)
# Impute missing values in 'col3' with the median
median_col3 = df['col3'].median()
df['col3'].fillna(median_col3, inplace=True)
# Impute missing values in 'col2' with the mode
mode_col2 = df['col2'].mode()[0] # mode() can return multiple values if there's a tie
df['col2'].fillna(mode_col2, inplace=True)
print("\nDataFrame after mean/median/mode imputation:")
print(df)
Pros: åçŽã§ãããŒã¿ã»ãããµã€ãºãä¿æãããŸãã
Cons: ããŒã¿ã®åæ£ãšå ±åæ£ãæªããå¯èœæ§ããããŸããå¹³å/äžå€®å€/æé »å€ãæ¬ æããŒã¿ã®é©åãªä»£è¡šå€ã§ãããšæ³å®ããŠããŸãããå¿ ããããããšã¯éããŸããã
b. Forward Fill and Backward Fill
ãããã®ã¡ãœããã¯ãæç³»åããŒã¿ãŸãã¯èªç¶ãªé åºãæã€ããŒã¿ã«ç¹ã«åœ¹ç«ã¡ãŸãã
- Forward Fill (
ffill): æ¬ æå€ãæåŸã«èªèãããæå¹ãªèŠ³æž¬å€ã§åããŸãã - Backward Fill (
bfill): æ¬ æå€ã次ã«èªèãããæå¹ãªèŠ³æž¬å€ã§åããŸãã
# Recreate a DataFrame with missing values suitable for ffill/bfill
data_time_series = {'value': [10, 12, np.nan, 15, np.nan, np.nan, 20]}
df_ts = pd.DataFrame(data_time_series)
print("\nOriginal DataFrame for time-series imputation:")
print(df_ts)
# Forward fill
df_ts_ffill = df_ts.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_ts_ffill)
# Backward fill
df_ts_bfill = df_ts.fillna(method='bfill')
print("\nDataFrame after backward fill:")
print(df_ts_bfill)
Pros: é åºä»ããããããŒã¿ã«åœ¹ç«ã¡ãæéçé¢ä¿ãä¿æããŸãã
Cons: æ¬ æããŒã¿ã®ã®ã£ãããé·ãå Žåã誀ã£ãå€ãäŒæããå¯èœæ§ããããŸããffillã¯å°æ¥ã®æ
å ±ãèæ
®ãããbfillã¯éå»ã®æ
å ±ãèæ
®ããŸããã
c. Imputation using Groupby
ããæŽç·Žãããã¢ãããŒãã¯ãã°ã«ãŒãçµ±èšã«åºã¥ããŠæ¬ æå€ãè£å®ããããšã§ããããã¯ãæ¬ æãããŒã¿å ã®ç¹å®ã®ã«ããŽãªãŸãã¯ã°ã«ãŒãã«é¢é£ããŠããçããããå Žåã«ç¹ã«åœ¹ç«ã¡ãŸãã
data_grouped = {
'category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'value': [10, 20, np.nan, 25, 15, 30, 12, np.nan]
}
df_grouped = pd.DataFrame(data_grouped)
print("\nOriginal DataFrame for grouped imputation:")
print(df_grouped)
# Impute missing 'value' based on the mean 'value' of each 'category'
df_grouped['value'] = df_grouped.groupby('category')['value'].transform(lambda x: x.fillna(x.mean()))
print("\nDataFrame after grouped mean imputation:")
print(df_grouped)
Pros: ã°ã«ãŒãéã®å€åãèæ ®ããå€ãã®å Žåãã°ããŒãã«ãªå¹³å/äžå€®å€/æé »å€ãããæ£ç¢ºãªè£å®ã«ã€ãªãããŸãã
Cons: é¢é£ããã°ã«ãŒãå倿°ãå¿ èŠã§ããéåžžã«å€§èŠæš¡ãªããŒã¿ã»ããã®å Žåãèšç®è² è·ãé«ããªãå¯èœæ§ããããŸãã
d. More Advanced Imputation Techniques
ããè€éãªã·ããªãªãç¹ã«æ©æ¢°åŠç¿ãã€ãã©ã€ã³ã§ã¯ããããã®é«åºŠãªæ¹æ³ãæ€èšããŠãã ããã
- K-Nearest Neighbors (KNN) Imputer: ãã¬ãŒãã³ã°ã»ããã§èŠã€ãã£ãKåã®æãè¿ã飿¥ã®å€ã䜿çšããŠãæ¬ æå€ãè£å®ããŸãã
- Iterative Imputer (e.g., using MICE - Multiple Imputation by Chained Equations): æ¬ æå€ã®ããåç¹åŸŽéãä»ã®ç¹åŸŽéã®é¢æ°ãšããŠã¢ãã«åããå埩ãã€ãºè¡åè£å®ã䜿çšããŠè£å®ããŸãã
- Regression Imputation: ååž°ã¢ãã«ã䜿çšããŠæ¬ æå€ãäºæž¬ããŸãã
ãããã®ã¡ãœããã¯ãäžè¬ã«Scikit-learnãªã©ã®ã©ã€ãã©ãªã§å©çšã§ããŸãã
# Example using Scikit-learn's KNNImputer
from sklearn.impute import KNNImputer
# KNNImputer works on numerical data. We'll use a sample numerical DataFrame.
data_knn = {'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 20, 30, 40, 50],
'C': [100, np.nan, 300, 400, 500]}
df_knn = pd.DataFrame(data_knn)
print("\nOriginal DataFrame for KNN imputation:")
print(df_knn)
imputer = KNNImputer(n_neighbors=2) # Use 2 nearest neighbors
df_knn_imputed_arr = imputer.fit_transform(df_knn)
df_knn_imputed = pd.DataFrame(df_knn_imputed_arr, columns=df_knn.columns)
print("\nDataFrame after KNN imputation:")
print(df_knn_imputed)
Pros: ç¹åŸŽééã®é¢ä¿ãèæ ®ããããšã§ãããæ£ç¢ºãªè£å®ãæäŸã§ããŸãã
Cons: èšç®è² è·ãé«ããæ³šææ·±ãå®è£ ãå¿ èŠã§ãããç¹åŸŽéã®é¢ä¿ã«é¢ããåæãæãç«ã€å¿ èŠããããŸãã
Handling Missing Values in Categorical Data
ã«ããŽãªããŒã¿ã«ã¯ãç¬èªã®èª²é¡ããããŸããæé »å€è£å®ã¯äžè¬çã§ãããä»ã®æŠç¥ã广çã§ãã
- Mode Imputation: åè¿°ã®ããã«ãæãé »ç¹ãªã«ããŽãªã§åããŸãã
- Creating a New Category: æ¬ æå€ãå¥ã®ã«ããŽãªãšããŠæ±ããŸãïŒäŸïŒãäžæãããæ¬ æãïŒãããã¯ãããŒã¿ãæ¬ æããŠãããšããäºå®èªäœãæçãªå Žåã«åœ¹ç«ã¡ãŸãã
- Imputation based on other features: ã«ããŽãªç¹åŸŽéãšä»ã®ç¹åŸŽéãšã®éã«åŒ·ãé¢ä¿ãããå Žåã¯ãåé¡åã䜿çšããŠæ¬ æã«ããŽãªãäºæž¬ã§ããŸãã
data_cat = {'Product': ['A', 'B', 'A', 'C', 'B', 'A', np.nan],
'Region': ['North', 'South', 'East', 'West', 'North', np.nan, 'East']}
df_cat = pd.DataFrame(data_cat)
print("\nOriginal DataFrame for categorical handling:")
print(df_cat)
# Strategy 1: Mode imputation for 'Region'
mode_region = df_cat['Region'].mode()[0]
df_cat['Region'].fillna(mode_region, inplace=True)
# Strategy 2: Create a new category for 'Product'
df_cat['Product'].fillna('Unknown', inplace=True)
print("\nDataFrame after categorical imputation:")
print(df_cat)
Best Practices and Considerations for a Global Audience
倿§ãªãœãŒã¹ããã®ããŒã¿ãã°ããŒãã«ãªèªè å±€ã察象ãšããããŒã¿ãæ±ãå Žåã¯ã以äžãèæ ®ããŠãã ããã
- Understand the Data Source: ãªãå€ãæ¬ æããŠããã®ã§ããïŒç¹å®ã®å°åãŸãã¯ãã©ãããã©ãŒã ã§ã®ããŒã¿åéã«ãããäœç³»çãªåé¡ã§ããïŒãã®èµ·æºãç¥ãããšã§ãæŠç¥ãå°ãããšãã§ããŸããããšãã°ã調æ»ãã©ãããã©ãŒã ãç¹å®ã®åœã®ç¹å®ã®ãã¢ã°ã©ãã£ãã¯ãåžžã«ãã£ããã£ã§ããªãå Žåããã®æ¬ æã¯ã©ã³ãã ã§ã¯ãªãå¯èœæ§ããããŸãã
- Context is Key: æ¬ æå€ãåŠçãããæ£ãããæ¹æ³ã¯ãã³ã³ããã¹ãã«äŸåããŸããéèã¢ãã«ã§ã¯ãããããªãã€ã¢ã¹ãåé¿ããããã«çްå¿ã®æ³šæãæã£ãŠè£å®ããå¿ èŠãããå ŽåããããŸãããç°¡åãªæ¢çŽ¢çåæã§ã¯ãããåçŽãªã¡ãœããã§ååãªå ŽåããããŸãã
- Cultural Nuances in Data: ããŒã¿åéæ¹æ³ã¯ãæåã«ãã£ãŠç°ãªãå ŽåããããŸããããšãã°ããåå ¥ããã©ã®ããã«å ±åããããããŸãã¯ã該åœãªãããäžè¬çãªåçã§ãããã©ããã¯ç°ãªãå ŽåããããŸããããã¯ãæ¬ æå€ãã©ã®ããã«è§£éããã³åŠçããããã«åœ±é¿ãäžããå¯èœæ§ããããŸãã
- Time Zones and Data Lag: ç°ãªãã¿ã€ã ãŸãŒã³ããã®æç³»åããŒã¿ã®å Žåãffill/bfillãªã©ã®æéããŒã¹ã®è£å®ã¡ãœãããé©çšããåã«ãããŒã¿ãæšæºåãããŠããïŒäŸïŒUTCïŒããšã確èªããŠãã ããã
- Currency and Units: ç°ãªãé貚ãŸãã¯åäœãå«ãæ°å€ãè£å®ããå Žåã¯ãè£å®åã«äžè²«æ§ãŸãã¯é©åãªå€æã確èªããŠãã ããã
- Document Your Decisions: æ¬ æããŒã¿ãåŠçããããã«äœ¿çšããã¡ãœãããåžžã«ææžåããŠãã ããããã®éææ§ã¯ãåçŸæ§ã«ãšã£ãŠéèŠã§ãããä»ã®äººãåæãçè§£ããããã«ãéèŠã§ãã
- Iterative Process: æ¬ æå€ã®åŠçãå«ãããŒã¿ã¯ãªãŒãã³ã°ã¯ãå€ãã®å Žåãå埩çãªããã»ã¹ã§ãã1ã€ã®ã¡ãœããã詊ããŠããã®åœ±é¿ãè©äŸ¡ããã¢ãããŒããæ¹è¯ããå ŽåããããŸãã
- Use Libraries Wisely: Pandasã¯äž»èŠãªããŒã«ã§ãããããè€éãªè£å®ã«ã¯Scikit-learnãéåžžã«åœ¹ç«ã¡ãŸããä»äºã«é©ããããŒã«ãéžæããŠãã ããã
Conclusion
æ¬ æå€ã¯ãçŸå®äžçã®ããŒã¿ãæ±ãäžã§é¿ããããªãéšåã§ããPython Pandasã¯ããããã®æ¬ æãšã³ããªãèå¥ãåæãããã³åŠçããããã®æè»ã§åŒ·åãªããŒã«ã»ãããæäŸããŸããåé€ãŸãã¯è£å®ã®ã©ã¡ããéžæããå Žåã§ããåã¡ãœããã«ã¯ç¬èªã®ãã¬ãŒããªãããããŸãããããã®ãã¯ããã¯ãçè§£ããããŒã¿ã®ã°ããŒãã«ãªã³ã³ããã¹ããèæ ®ããããšã§ãããŒã¿åæã𿩿¢°åŠç¿ã¢ãã«ã®å質ãšä¿¡é Œæ§ãå€§å¹ ã«åäžãããããšãã§ããŸãããããã®ããŒã¿ã¯ãªãŒãã³ã°ã¹ãã«ãç¿åŸããããšã¯ãäžçã®ã©ãã«ããŠã广çãªããŒã¿ãããã§ãã·ã§ãã«ã«ãªãããã®åºç€ãšãªããŸãã
Key Takeaways:
- Identify:
df.isnull().sum()ãšèŠèŠåã䜿çšããŸãã - Delete: ããŒã¿æå€±ãèªèããŠã
dropna()ãè³¢æã«äœ¿çšããŸãã - Impute: å¹³åãäžå€®å€ãæé »å€ãffillãbfillããŸãã¯Scikit-learnã®ããé«åºŠãªãã¯ããã¯ã䜿çšããŠã
fillna()ã䜿çšããŸãã - Context Matters: æé©ãªæŠç¥ã¯ãããŒã¿ãšç®æšã«ãã£ãŠç°ãªããŸãã
- Global Awareness: æåçãã¥ã¢ã³ã¹ãšããŒã¿ã®èµ·æºãèæ ®ããŠãã ããã
ãããã®ãã¯ããã¯ãç·Žç¿ãç¶ãããšãå ç¢ãªããŒã¿ãµã€ãšã³ã¹ã¯ãŒã¯ãããŒã®åŒ·åãªåºç€ãæ§ç¯ãããŸãã